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Abstract 

A  good  decoding  algorithm  is  critical 
to  the  success  of  any  statistical  machine 
translation  system.  The  decoder’s  job  is 
to  find  the  translation  that  is  most  likely 
according  to  set  of  previously  learned 
parameters  (and  a  formula  for  combin¬ 
ing  them).  Since  the  space  of  possi¬ 
ble  translations  is  extremely  large,  typ¬ 
ical  decoding  algorithms  are  only  able 
to  examine  a  portion  of  it,  thus  risk¬ 
ing  to  miss  good  solutions.  In  this  pa¬ 
per,  we  compare  the  speed  and  out¬ 
put  quality  of  a  traditional  stack-based 
decoding  algorithm  with  two  new  de¬ 
coders:  a  fast  greedy  decoder  and  a 
slow  but  optimal  decoder  that  treats  de¬ 
coding  as  an  integer-programming  opti¬ 
mization  problem. 

1  Introduction 

A  statistical  MT  system  that  translates  (say) 
French  sentences  into  English,  is  divided  into 
three  parts:  (1)  a  language  model  (LM)  that  as¬ 
signs  a  probability  P(e)  to  any  English  string,  (2)  a 
translation  model  (TM)  that  assigns  a  probability 
P(f|e)  to  any  pair  of  English  and  Erench  strings, 
and  (3)  a  decoder.  The  decoder  takes  a  previ¬ 
ously  unseen  sentence  /  and  tries  to  find  the  e 
that  maximizes  P(e|f),  or  equivalently  maximizes 
P(e)  •  P(f|e). 

Brown  et  al.  (1993)  introduced  a  series  of 
TMs  based  on  word-for-word  substitution  and  re¬ 
ordering,  but  did  not  include  a  decoding  algo¬ 
rithm.  If  the  source  and  target  languages  are  con¬ 
strained  to  have  the  same  word  order  (by  choice 


or  through  suitable  pre-processing),  then  the  lin¬ 
ear  Viterbi  algorithm  can  be  applied  (Tillmann  et 
al.,  1997).  If  re-ordering  is  limited  to  rotations 
around  nodes  in  a  binary  tree,  then  optimal  decod¬ 
ing  can  be  carried  out  by  a  high-polynomial  algo¬ 
rithm  (Wu,  1996).  Eor  arbitrary  word-reordering, 
the  decoding  problem  is  NP-complete  (Knight, 
1999). 

A  sensible  strategy  (Brown  et  al.,  1995;  Wang 
and  Waibel,  1997)  is  to  examine  a  large  subset  of 
likely  decodings  and  choose  just  from  that.  Of 
course,  it  is  possible  to  miss  a  good  translation 
this  way.  If  the  decoder  returns  e'  but  there  exists 
some  e  for  which  P(e|f)  >  P(e'|f),  this  is  called 
a  search  error.  As  Wang  and  Waibel  (1997)  re¬ 
mark,  it  is  hard  to  know  whether  a  search  error 
has  occurred — the  only  way  to  show  that  a  decod¬ 
ing  is  sub-optimal  is  to  actually  produce  a  higher¬ 
scoring  one. 

Thus,  while  decoding  is  a  clear-cut  optimiza¬ 
tion  task  in  which  every  problem  instance  has  a 
right  answer,  it  is  hard  to  come  up  with  good 
answers  quickly.  This  paper  reports  on  mea¬ 
surements  of  speed,  search  errors,  and  translation 
quality  in  the  context  of  a  traditional  stack  de¬ 
coder  (Jelinek,  1969;  Brown  et  al.,  1995)  and  two 
new  decoders.  The  first  is  a  fast  greedy  decoder, 
and  the  second  is  a  slow  optimal  decoder  based  on 
generic  mathematical  programming  techniques. 

2  IBM  Model  4 

In  this  paper,  we  work  with  IBM  Model  4,  which 
revolves  around  the  notion  of  a  word  alignment 
over  a  pair  of  sentences  (see  Eigure  1).  A  word 
alignment  assigns  a  single  home  (English  string 
position)  to  each  Erench  word.  If  two  Erench 
words  align  to  the  same  English  word,  then  that 
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it  is  not  clear  . 

I  \  I  \  \ 

I  \  +  \  \ 

I  \/  \  \  \ 

I  /\  \  \  \ 

CE  NE  EST  PAS  CLAIR 


Figure  1:  Sample  word  alignment. 


English  word  is  said  to  have  a  fertility  of  two. 
Likewise,  if  an  English  word  remains  unaligned- 
to,  then  it  has  fertility  zero.  The  word  align¬ 
ment  in  Eigure  1  is  shorthand  for  a  hypothetieal 
stoehastie  proeess  by  whieh  an  English  string  gets 
eonverted  into  a  Ereneh  string.  There  are  several 
sets  of  deeisions  to  be  made. 

Eirst,  every  English  word  is  assigned  a  fertil¬ 
ity.  These  assignments  are  made  stoehastieahy 
aeeording  to  a  table  n(^  |  e,).  We  delete  from 
the  string  any  word  with  fertility  zero,  we  dupli- 
eate  any  word  with  fertility  two,  ete.  If  a  word  has 
fertility  greater  than  zero,  we  eah  it  fertile.  If  its 
fertility  is  greater  than  one,  we  eall  it  very  fertile. 

After  eaeh  English  word  in  the  new  string,  we 
may  inerement  the  fertility  of  an  invisible  En¬ 
glish  NULL  element  with  probability  pi  (typi¬ 
cally  about  0.02).  The  NULL  element  ultimately 
produces  “spurious”  Ereneh  words. 

Next,  we  perform  a  word-for-word  replace¬ 
ment  of  English  words  (including  NULL)  by 
Ereneh  words,  according  to  the  table  t(fj  |  e,). 

Einahy,  we  permute  the  Ereneh  words.  In  per¬ 
muting,  Model  4  distinguishes  between  Ereneh 
words  that  are  heads  (the  leftmost  Ereneh  word 
generated  from  a  particular  English  word),  non- 
heads  (non-leftmost,  generated  only  by  very  fer¬ 
tile  English  words),  and  NULL-generated. 

Heads.  The  head  of  one  English  word  is  as¬ 
signed  a  Ereneh  string  position  based  on  the  po¬ 
sition  assigned  to  the  previous  English  word.  If 
an  English  word  ej_i  translates  into  something 
at  Ereneh  position  j,  then  the  Ereneh  head  word 
of  Cj  is  stochastically  placed  in  Ereneh  position 
k  with  distortion  probability  di(k-j  |  class(ej_i), 
class(fjfc)),  where  “class”  refers  to  automatically 
determined  word  classes  for  Ereneh  and  English 
vocabulary  items.  This  relative  offset  k-j  encour¬ 
ages  adjacent  English  words  to  translate  into  ad¬ 
jacent  Ereneh  words.  If  ej_i  is  infertile,  then  j  is 


taken  from  ej_2,  etc.  If  ej_i  is  very  fertile,  then  j 
is  the  average  of  the  positions  of  its  Ereneh  trans¬ 
lations. 

Non-heads.  If  the  head  of  English  word  e, 
is  placed  in  Ereneh  position  j,  then  its  first  non- 
head  is  placed  in  Ereneh  position  k  (>  j)  accord¬ 
ing  to  another  table  d>i(k-j  |  class(fjfc)).  The  next 
non-head  is  placed  at  position  q  with  probability 
d>i(q-k  I  class(fg)),  and  so  forth. 

NULL-generated.  After  heads  and  non-heads 
are  placed,  NULL-generated  words  are  permuted 
into  the  remaining  vacant  slots  randomly.  If  there 
are  fa  NULL-generated  words,  then  any  place¬ 
ment  scheme  is  chosen  with  probability  ll(f)Q\. 

These  stochastic  decisions,  starting  with  e,  re¬ 
sult  in  different  choices  of  f  and  an  alignment  of  f 
with  e.  We  map  an  e  onto  a  particular  <a,f>  pair 
with  probability: 


P(a,  f  I  e)  = 
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where  the  factors  separated  by  x  symbols  denote 
fertility,  translation,  head  permutation,  non-head 
permutation,  null-fertility,  and  null-translation 
probabilities.^ 

3  Definition  of  the  Problem 

If  we  observe  a  new  sentence  f,  then  an  optimal 
decoder  will  search  for  an  e  that  maximizes  P(e|f) 

*The  symbols  in  this  formula  are:  I  (the  length  of  e),  m 
(the  length  of  f),  e^  (the  English  word  in  e),  eo  (the  NULL 
word),  01  (the  fertility  of  e*),  0o  (the  fertility  of  the  NULL 
word),  Tik  (the  French  word  produced  by  e^  in  a),  -Kik 
(the  position  of  Tik  in  f).  pi  (the  position  of  the  first  fertile 
word  to  the  left  of  e*  in  a),  (the  ceiling  of  the  average  of 
all  Tip-k  for  pi,  or  0  if  pi  is  undefined). 


P(e)  •  P(f|e).  Here,  P(f|e)  is  the  sum  of  P(a,f|e) 
over  all  possible  alignments  a.  Because  this 
sum  involves  significant  computation,  we  typi¬ 
cally  avoid  it  by  instead  searching  for  an  <e,a> 
pair  that  maximizes  P(e,a|f)  P(e)  •  P(a,f|e).  We 
take  the  language  model  P(e)  to  be  a  smoothed 
n-gram  model  of  English. 

4  Stack-Based  Decoding 

The  stack  (also  called  A*)  decoding  algorithm  is 
a  kind  of  best-first  search  which  was  first  intro¬ 
duced  in  the  domain  of  speech  recognition  (Je- 
linek,  1969).  By  building  solutions  incremen¬ 
tally  and  storing  partial  solutions,  or  hypotheses, 
in  a  “stack”  (in  modern  terminology,  a  priority 
queue),  the  decoder  conducts  an  ordered  search 
of  the  solution  space.  In  the  ideal  case  (unlimited 
stack  size  and  exhaustive  search  time),  a  stack  de¬ 
coder  is  guaranteed  to  find  an  opfimal  solufion; 
our  hope  is  fo  do  almosf  as  well  under  real-world 
consfrainfs  of  limifed  space  and  lime.  The  generic 
slack  decoding  algorilhm  follows: 

•  Inilialize  Ihe  slack  wilh  an  emply  hy- 
polhesis. 

•  Pop  h,  Ihe  besl  hypolhesis,  off  Ihe  slack. 

•  If  /i  is  a  complete  senlence,  oulpuf  h  and 
terminate. 

•  For  each  possible  next  word  w,  extend  h 
by  adding  w  and  push  the  resulting  hy¬ 
pothesis  onto  the  stack. 

•  Return  to  the  second  step  (pop). 

One  crucial  difference  between  the  decoding 
process  in  speech  recognition  (SR)  and  machine 
translation  (MT)  is  that  speech  is  always  pro¬ 
duced  in  the  same  order  as  its  transcription.  Con¬ 
sequently,  in  SR  decoding  there  is  always  a  sim¬ 
ple  left-to-right  correspondence  between  input 
and  output  sequences.  By  contrast,  in  MT  the  left- 
to-right  relation  rarely  holds  even  for  language 
pairs  as  similar  as  French  and  English.  We  ad¬ 
dress  this  problem  by  building  the  solution  from 
left  to  right,  but  allowing  the  decoder  to  consume 
its  input  in  any  order.  This  change  makes  decod¬ 
ing  significantly  more  complex  in  MT;  instead  of 
knowing  the  order  of  the  input  in  advance,  we 
must  consider  all  n!  permutations  of  an  n-word 
input  sentence. 

Another  important  difference  between  SR  and 
MT  decoding  is  the  lack  of  reliable  heuristics 


in  MT.  A  heuristic  is  used  in  A*  search  to  es¬ 
timate  the  cost  of  completing  a  partial  hypothe¬ 
sis.  A  good  heuristic  makes  it  possible  to  accu¬ 
rately  compare  the  value  of  different  partial  hy¬ 
potheses,  and  thus  to  focus  the  search  in  the  most 
promising  direction.  The  left-to-right  restriction 
in  SR  makes  it  possible  to  use  a  simple  yet  reli¬ 
able  class  of  heuristics  which  estimate  cost  based 
on  the  amount  of  input  left  to  decode.  Partly  be¬ 
cause  of  the  absence  of  left-to-right  correspon¬ 
dence,  MT  heuristics  are  significantly  more  dif¬ 
ficult  to  develop  (Wang  and  Waibel,  1997).  With¬ 
out  a  heuristic,  a  classic  stack  decoder  is  inef¬ 
fective  because  shorter  hypotheses  will  almost  al¬ 
ways  look  more  attractive  than  longer  ones,  since 
as  we  add  words  to  a  hypothesis,  we  end  up  mul¬ 
tiplying  more  and  more  terms  to  find  the  proba¬ 
bility.  Because  of  this,  longer  hypotheses  will  be 
pushed  off  the  end  of  the  stack  by  shorter  ones 
even  if  they  are  in  reality  better  decodings.  For¬ 
tunately,  by  using  more  than  one  stack,  we  can 
eliminate  this  effect. 

In  a  multistack  decoder,  we  employ  more  than 
one  stack  to  force  hypotheses  to  compete  fairly. 
More  specifically,  we  have  one  slack  for  each  sub- 
sel  of  inpul  words.  This  way,  a  hypolhesis  can 
only  be  pruned  if  Ihere  are  olher,  heller,  hypolhe- 
ses  lhaf  represenl  Ihe  same  portion  of  Ihe  inpul. 
Wilh  more  lhan  one  slack,  however,  how  does  a 
mulfislack  decoder  choose  which  hypolhesis  lo 
extend  during  each  iteration?  We  address  Ihis  is¬ 
sue  by  simply  faking  one  hypolhesis  from  each 
slack,  bul  a  heller  solution  would  be  lo  somehow 
compare  hypolheses  from  differenl  slacks  and  ex¬ 
tend  only  Ihe  besl  ones. 

The  multislack  decoder  we  describe  is  closely 
pallerned  on  Ihe  Model  3  decoder  described  in  Ihe 
(Brown  el  ah,  1995)  palenl.  We  build  solutions 
incremenlally  by  applying  operations  lo  hypolhe¬ 
ses.  There  are  four  operations: 

•  Add  adds  a  new  English  word  and 
aligns  a  single  French  word  lo  il. 

•  AddZfert  adds  Iwo  new  English  words. 

The  firsl  has  ferlilily  zero,  white  Ihe 
second  is  aligned  lo  a  single  French 
word. 

•  Extend  aligns  an  additional  French 
word  lo  Ihe  mosl  recenl  English  word, 
increasing  ils  ferlilily. 


•  AddNull  aligns  a  French  word  to  the 
English  NULL  element. 

AddZfert  is  by  far  the  most  expensive  opera¬ 
tion,  as  we  must  consider  inserting  a  zero-fertility 
English  word  before  each  translation  of  each  un¬ 
aligned  Lrench  word.  With  an  English  vocabulary 
size  of  40,000,  AddZfert  is  400,000  times  more 
expensive  than  AddNull ! 

We  can  reduce  the  cost  of  AddZfert  in  two 
ways.  Eirst,  we  can  consider  only  certain  English 
words  as  candidates  for  zero-fertility,  namely 
words  which  both  occur  frequently  and  have 
a  high  probability  of  being  assigned  frequency 
zero.  Second,  we  can  only  insert  a  zero-fertility 
word  if  it  will  increase  the  probability  of  a  hypoth¬ 
esis.  According  to  the  definition  of  the  decoding 
problem,  a  zero-fertility  English  word  can  only 
make  a  decoding  more  likely  by  increasing  P(e) 
more  than  it  decreases  P(a,f|e).^  By  only  con¬ 
sidering  helpful  zero-fertility  insertions,  we  save 
ourselves  significant  overhead  in  the  AddZfert 
operation,  in  many  cases  eliminating  all  possi¬ 
bilities  and  reducing  its  cost  to  less  than  that  of 
AddNull. 

5  Greedy  Decoding 

Over  the  last  decade,  many  instances  of  NP- 
complete  problems  have  been  shown  to  be  solv¬ 
able  in  reasonable/polynomial  time  using  greedy 
methods  (Selman  et  ah,  1992;  Monasson  et  ah, 
1999).  Instead  of  deeply  probing  the  search 
space,  such  greedy  methods  typically  start  out 
with  a  random,  approximate  solution  and  then  try 
to  improve  it  incrementally  until  a  satisfactory  so¬ 
lution  is  reached.  In  many  cases,  greedy  methods 
quickly  yield  surprisingly  good  solutions. 

We  conjectured  that  such  greedy  methods  may 
prove  to  be  helpful  in  the  context  of  MT  decod¬ 
ing.  The  greedy  decoder  that  we  describe  starts 
the  translation  process  from  an  English  gloss  of 
the  Erench  sentence  given  as  input.  The  gloss 
is  constructed  by  aligning  each  Erench  word  ij 
with  its  most  likely  English  translation  e^ .  (e^ .  = 
argmaxe  t(e  |  f^)).  Eor  example,  in  translating  the 
Erench  sentence  “Bien  entendu  ,  il  parle  de  une 
belle  victoire  .”,  the  greedy  decoder  initially  as- 

^We  know  that  adding  a  zero-fertility  word  will  decrease 
P(a,f|e)  because  it  adds  a  term  n(0  |  e*)  <  1  to  the  calculation. 


sumes  that  a  good  translation  of  it  is  “Well  heard 
,  it  talking  a  beautiful  victory”  because  the  best 
translation  of  “bien”  is  “well”,  the  best  translation 
of  “entendu”  is  “heard”,  and  so  on.  The  alignment 
corresponding  to  this  translation  is  shown  at  the 
top  of  Eigure  2. 

Once  the  initial  alignment  is  created,  the 
greedy  decoder  tries  to  improve  it,  i.e.,  tries  to 
find  an  alignmenf  (and  implicifly  franslafion)  of 
higher  probabilify,  by  applying  one  of  fhe  follow¬ 
ing  operafions: 

•  translateOneOrTwoWords(ji  ,ei,  j2 ,62) 
changes  fhe  franslafion  of  one  or  fwo  Erench 
words,  fhose  locafed  af  posifions  ji  and  j2, 
from  and  cj^.^  info  ci  and  e.2-  If  is 
a  word  of  ferfilify  1  and  ejt  is  NULL,  fhen 
e.f.  is  delefed  from  fhe  franslafion.  If  e.f.  is 
fhe  NULL  word,  fhe  word  e.^  is  inserfed  info 
fhe  franslafion  af  fhe  position  fhaf  yields  fhe 
alignmenf  of  highesf  probabilify.  If  e f.^  — 
Cl  or  —  e2,  fhis  operation  amounfs  fo 
changing  fhe  franslafion  of  a  single  word. 

•  translateAndInsert(j,ei,e2)  changes  fhe 
franslafion  of  fhe  Erench  word  locafed  af  po¬ 
sition  j  from  e f.  info  ei  and  simulafaneously 
inserfs  word  e2  af  fhe  posifion  fhaf  yields  fhe 
alignmenf  of  highesf  probabilify.  Word  62 
is  selecfed  from  an  aufomafically  derived  lisf 
of  1024  words  wifh  high  probabilify  of  hav¬ 
ing  ferfilify  0.  When  cj^.  =  ei,  fhis  operation 
amounfs  fo  inserfing  a  word  of  ferfilify  0  info 
fhe  alignmenf. 

•  removeWordOfFertilityO(«)  delefes  fhe 
word  of  ferfilify  0  af  posifion  i  in  fhe  currenf 
alignmenf. 

•  swapSegments(ii, 'i2,  ji, 72)  creafes  a  new 
alignmenf  from  fhe  old  one  by  swap¬ 
ping  non-overlapping  English  word  seg- 
menfs  [fi,i2]  and  [ji, 72]-  During  fhe  swap 
operafion,  all  exisfing  links  befween  English 
and  Erench  words  are  preserved.  The  seg- 
menfs  can  be  as  small  as  a  word  or  as  long  as 

I  e  I  —  1  words,  where  |  e  |  is  fhe  lengfh  of 
fhe  English  senfence. 

•  joinWords(«i,  ^2)  eliminates  from  fhe  align- 
menf  fhe  English  word  af  posifion  ii  (or  ^2) 
and  links  fhe  Erench  words  generafed  by  eq 
(or  6,2)10  6^2  (or  ejj. 


NULL  well  heard  ,  It  talking  a  beautllul  victory 


blen  entendu  ,  II  parle  de  une  belle  victoire  ‘ 


NULL  well  heard  ,  It  talks  a  great  victory 


blen  entendu  ,  II  parle  de  une  belle  victoire  ‘ 


NULL  well  understood  ,  It  talks  about  a  great  victory  . 


blen  entendu  ,  II  parle  de  une  belle  victoire  ‘ 


Action 


translateTwoWords(5, talks,?  .great) 


translateTwoWords(2, understood, 0, about) 


translateOneWord(4,he) 


TranslateOneOrTwo Words  iterates  over  |  /  p 
X  I  t  p  alignments,  where  |  /  |  is  the  size  of  the 
Freneh  sentenee  and  |  f  |  is  the  number  of  trans¬ 
lations  we  assoeiate  with  eaeh  word  (in  our  im¬ 
plementation,  we  limit  this  number  to  the  top  10 
translations).  TranslateAndInsert  iterates  over 
I  /  I  X  I  f  I  X  I  I  alignments,  where  |  ^;  |  is  the 
size  of  the  list  of  words  with  high  probability  of 
having  fertility  0  (1024  words  in  our  implementa¬ 
tion). 


NULL  well  understood  ,  he  talks  about  a  great  victory  . 

I  I  I  I  /  /  /  translateTwoWords(1,quite,2,naturally) 

blen  entendu  ,  II  parle  de  une  belle  victoire  ‘ 

NULL  quite  naturally  ,  he  talks  about  a  great  victory  . 


blen  entendu  ,  II  parle  de  une  belle  victoire  ‘ 


Figure  2:  Example  of  how  the  greedy  deeoder 
produees  the  translation  of  Freneh  sentenee  “Bien 
entendu,  il  parle  de  une  belle  vietoire.” 

In  a  stepwise  fashion,  starting  from  the  initial 
gloss,  the  greedy  deeoder  iterates  exhaustively 
over  all  alignments  that  are  one  operation  away 
from  the  alignment  under  eonsideration.  At  every 
step,  the  deeoder  ehooses  the  alignment  of  high¬ 
est  probability,  until  the  probability  of  the  eurrent 
alignment  ean  no  longer  be  improved.  When  it 
starts  from  the  gloss  of  the  Freneh  sentenee  “Bien 
entendu,  il  parle  de  une  belle  vietoire.”,  for  ex¬ 
ample,  the  greedy  deeoder  alters  the  initial  align¬ 
ment  inerementally  as  shown  in  Figure  2,  eventu¬ 
ally  produeing  the  translation  “Quite  naturally,  he 
talks  about  a  great  vietory.”.  In  the  proeess,  the 
deeoder  explores  a  total  of  77421  distinet  align¬ 
ments/translations,  of  whieh  “Quite  naturally,  he 
talks  about  a  great  vietory.”  has  the  highest  prob¬ 
ability. 

We  ehose  the  operation  types  enumerated 
above  for  two  reasons:  (i)  they  are  general  enough 
to  enable  the  deeoder  eseape  loeal  maxima  and 
modify  in  a  non-trivial  manner  a  given  align¬ 
ment  in  order  to  produee  good  translations;  (ii) 
they  are  relatively  inexpensive  (timewise).  The 
most  time  eonsuming  operations  in  the  deeoder 
are  swapSegments,  translateOneOrTwo  Words, 
and  translateAndInsert.  SwapSegments  iter¬ 
ates  over  all  possible  non-overlapping  span  pairs 
that  ean  be  built  on  a  sequenee  of  length  |  e  |. 


6  Integer  Programming  Decoding 

Knight  (1999)  likens  MT  deeoding  to  finding 
optimal  tours  in  the  Traveling  Salesman  Prob¬ 
lem  (Garey  and  Johnson,  1979) — ehoosing  a 
good  word  order  for  deeoder  output  is  similar 
to  ehoosing  a  good  TSP  tour.  Beeause  any  TSP 
problem  instanee  ean  be  transformed  into  a  de¬ 
eoding  problem  instanee.  Model  4  deeoding  is 
provably  NP-eomplete  in  the  length  of  f.  It  is 
interesting  to  eonsider  the  reverse  direetion — is 
it  possible  to  transform  a  deeoding  problem  in¬ 
stanee  into  a  TSP  instanee?  If  so,  we  may  take 
great  advantage  of  previous  researeh  into  effieient 
TSP  algorithms.  We  may  also  take  advantage  of 
existing  software  paekages,  obtaining  a  sophisti- 
eated  deeoder  with  little  programming  effort. 

It  is  diffieult  to  eonvert  deeoding  into  straight 
TSP,  but  a  wide  range  of  eombinatorial  optimiza¬ 
tion  problems  (ineluding  TSP)  ean  be  expressed 
in  the  more  general  framework  of  linear  integer 
programming.  A  sample  integer  program  (IP) 
looks  like  this: 

minimize  objective  function: 

3.2*xl+4.7*x2-2.1*x3 
subject  to  constraints: 
xl  -  2 . 6  *  x3  >  5 
7 . 3  *  x2  >  7 

A  solution  to  an  IP  is  an  assignment  of  inte¬ 
ger  values  to  variables.  Solutions  are  eonstrained 
by  inequalities  involving  linear  eombinations  of 
variables.  An  optimal  solution  is  one  that  re- 
speets  the  eonstraints  and  minimizes  the  value  of 
the  objeetive  funetion,  whieh  is  also  a  linear  eom- 
bination  of  variables.  We  ean  solve  IP  instanees 
with  generie  problem-solving  software  sueh  as 
Ipusolve  or  CPLEX.^  In  this  seetion  we  explain 

^Available  at  ftp://ftp.ics.ele.tue.nl/pub/lp_solve  and 
http://www.cplex.com. 


Figure  3:  A  salesman  graph  for  the  input  sen- 
tenee  f  =  “CE  NE  EST  PAS  CEAIR  There  is 
one  eity  for  eaeh  word  in  f.  City  boundaries  are 
marked  with  bold  lines,  and  hotels  are  illustrated 
with  reetangles.  A  tour  of  eities  is  a  sequenee 
of  hotels  (starting  at  the  sentenee  boundary  hotel) 
that  visits  eaeh  eity  exaetly  onee  before  returning 
to  the  start. 

how  to  express  MT  deeoding  (Model  4  plus  En¬ 
glish  bigrams)  in  IP  format. 

We  first  ereate  a  salesman  graph  like  the  one 
in  Eigure  3.  To  do  this,  we  set  up  a  city  for  eaeh 
word  in  the  observed  sentenee  f.  City  boundaries 
are  shown  with  bold  lines.  We  populate  eaeh  eity 
with  ten  hotels  eorresponding  to  ten  likely  En¬ 
glish  word  translations.  Hotels  are  shown  as  small 
reetangles.  The  owner  of  a  hotel  is  the  English 
word  inside  the  ree tangle.  If  two  eities  have  hotels 
with  the  same  owner  x,  then  we  build  a  third  x- 
owned  hotel  on  the  border  of  the  two  eities.  More 
generally,  if  n  eities  all  have  hotels  owned  by  x, 
we  build  2”  —  n  —  1  new  hotels  (one  for  eaeh 
non-empty,  non-singleton  subset  of  the  eities)  on 
various  eity  borders  and  interseetions.  Einally,  we 
add  an  extra  eity  representing  the  sentenee  bound¬ 
ary. 

We  define  a  tour  of  cities  as  a  sequenee  and  ho- 
fels  (sfarfing  af  fhe  senfenee  boundary  hofel)  fhaf 
visifs  eaeh  eify  exaefly  onee  before  refurning  fo 
fhe  sfarf.  If  a  hofel  sifs  on  fhe  border  befween  fwo 
eifies,  fhen  slaying  al  fhaf  hofel  eounfs  as  visif- 
ing  both  eifies.  We  ean  view  eaeh  lour  of  eifies 
as  eorresponding  lo  a  polenlial  deeoding  <e,a>. 


The  owners  of  fhe  holels  on  fhe  lour  give  us  e, 
while  fhe  hofel  loealions  yield  a. 

The  nexl  fask  is  fo  eslablish  real-valued  (asym- 
melrie)  dislanees  befween  pairs  of  holels,  sueh 
fhaf  fhe  lenglh  of  any  lour  is  exaefly  fhe  negalive 
of  log(P(e)  •  P(a,f|e)).  Beeause  log  is  monolonie, 
fhe  shorlesl  lour  will  eorrespond  fo  fhe  likeliesl 
deeoding. 

The  dislanee  we  assign  lo  eaeh  pair  of  holels 
eonsisls  of  some  small  pieee  of  fhe  Model  4  for¬ 
mula.  The  usual  ease  is  lypified  by  fhe  large  blaek 
arrow  in  Eigure  3.  Beeause  fhe  deslinalion  ho- 
lel  “nol”  sifs  on  fhe  border  befween  eifies  NE 
and  PAS,  if  eorresponds  lo  a  parlial  alignmenf  in 
whieh  fhe  word  “nol”  has  ferlilily  fwo: 

. .  .  what  not  .  .  . 

/  _/\_ 

/  /  \ 

CE  NE  EST  PAS  CLAIR  . 

If  we  assume  fhaf  we  have  already  paid  fhe 
priee  for  visiting  fhe  “whal”  hofel,  fhen  our  inler- 
holel  dislanee  need  only  aeeounl  for  fhe  parlial 
alignmenf  eoneerning  “nol”: 
dislanee  = 

-  log(bigram(nol  |  whal)) 

-  log(n(2  I  nol)) 

-  log(l(NE  I  nol))  -  log(l(PAS  |  nol)) 

-log(di(-i-l  I  elass(whal),  elass(NE))) 

-  log(d>i(-i-2  I  elass(PAS))) 

NUEE-owned  holels  are  Irealed  speeially.  We 

require  lhal  all  non-NUEE  holels  be  visited  be¬ 
fore  any  NUEE  hotels,  and  we  furlher  require  lhal 
al  mosl  one  NUEE  hotel  visited  on  a  lour.  More¬ 
over,  Ihe  NUEE  ferlilily  sub-formula  is  easy  lo 
eompule  if  we  allow  only  one  NUEE  hotel  lo  be 
visited:  fa  is  simply  Ihe  number  of  eities  lhal  ho¬ 
tel  slraddles,  and  m  is  Ihe  number  of  eities  minus 
one.  This  ease  is  lypified  by  Ihe  large  gray  arrow 
shown  in  Eigure  3. 

Belween  hotels  lhal  are  loealed  (even  partially) 
in  Ihe  same  eily,  we  assign  an  infinite  dislanee  in 
bolh  direelions,  as  Iravel  from  one  lo  Ihe  olher  ean 
never  be  pari  of  a  lour.  Eor  6-word  Ereneh  sen- 
lenees,  we  normally  eome  up  wilh  a  graph  lhal  has 
aboul  80  hotels  and  3500  finile-eosl  Iravel  seg- 
menls. 

The  nexl  step  is  lo  easl  lour  seleelion  as  an  inte¬ 
ger  program.  Here  we  adapl  a  subtour  elimination 
slralegy  used  in  slandard  TSP  We  ereate  a  binary 
(0/1)  integer  variable  Xij  for  eaeh  pair  of  hotels  i 


and  j.  Xij  =  1  if  and  only  if  travel  from  hotel  i  to 
hotel  j  is  on  the  itinerary.  The  objeetive  funetion 
is  straightforward: 

minimize:  •  distanee(?,  j) 

(id) 

This  minimization  is  subjeet  to  three  elasses  of 
eonstraints.  First,  every  eity  must  be  visited  ex- 
aetly  onee.  That  means  exaetly  one  tour  segment 
must  exit  eaeh  eity: 


V, 


cEcities 


E 

i  located  at  least 


'^Xij  —  1 

j 


partially  in  C 


Seeond,  the  segments  must  be  linked  to  one 
another,  i.e.,  every  hotel  has  either  (a)  one  tour 
segment  eoming  in  and  one  going  out,  or  (b)  no 
segments  in  and  none  out.  To  put  it  another  way, 
every  hotel  must  have  an  equal  number  of  tour 
segments  going  in  and  out: 


Vj  :  ^2  —  ^2 


Third,  it  is  neeessary  to  prevent  multiple  inde¬ 
pendent  sub-tours.  To  do  this,  we  require  that  ev¬ 
ery  proper  subset  of  eities  have  at  least  one  tour 
segment  leaving  it: 


V, 


sCcities 


E 


i  located 
entirely 
within  S 


j  located 
at  least 
partially 
outside  S 


There  are  an  exponential  number  of  eonstraints  in 
this  third  elass. 

Finally,  we  invoke  our  IP  solver.  If  we  assign 
mnemonie  names  to  the  variables,  we  ean  easily 
extraet  <e,a>  from  the  list  of  variables  and  their 
binary  values.  The  shortest  tour  for  the  graph  in 
Figure  3  eorresponds  to  this  optimal  deeoding: 
it  is  not  clear  . 

We  ean  obtain  the  seeond-best  deeoding  by 
adding  a  new  eonstraint  to  the  IP  to  stop  it  from 
ehoosing  the  same  solution  again."* 

“^If  we  simply  replace  “minimize”  with  “maximize,”  we 
can  obtain  the  longest  tour,  which  corresponds  to  the  worst 
decoding! 


7  Experiments  and  Discussion 

In  our  experiments  we  used  a  test  eolleetion  of 
505  sentenees,  uniformly  distributed  aeross  the 
lengths  6,  8,  10,  15,  and  20.  We  evaluated  all 
deeoders  with  respeet  to  (1)  speed,  (2)  seareh  op¬ 
timality,  and  (3)  translation  aeeuraey.  The  last  two 
faetors  may  not  always  eoineide,  as  Model  4  is  an 
imperfeet  model  of  the  translation  proeess — i.e., 
there  is  no  guarantee  that  a  numerieally  optimal 
decoding  is  actually  a  good  translation. 

Suppose  a  decoder  outputs  e',  while  the  opti¬ 
mal  decoding  turns  out  to  be  e.  Then  we  consider 
six  possible  outcomes: 

•  no  error  (NE):  e'  =  e,  and  e'  is  a  perfect 
translation. 

•  pure  model  error  (PME):  e'  =  e,  but  e' 
is  not  a  perfect  translation. 

•  deadly  search  error  (DSE):  e'  /  e,  and 
while  e  is  a  perfect  translation,  while  e' 
is  not. 

•  fortuitous  search  error  (ESE):  e'  ^  e, 
and  e'  is  a  perfect  translation,  while  e  is 
not. 

•  harmless  search  error  (HSE):  e'  ^  e, 
but  e'  and  e  are  both  perfectly  good 
translations. 

•  compound  error  (CE):  e'  ^  e,  and  nei¬ 
ther  is  a  perfect  translation. 

Here,  “perfect”  refers  to  a  human-judged  transla¬ 
tion  that  transmits  all  of  the  meaning  of  the  source 
sentence  using  flawless  target-language  syntax. 

We  have  found  it  very  useful  to  have  several  de¬ 
coders  on  hand.  It  is  only  through  IP  decoder  out¬ 
put,  for  example,  that  we  can  know  the  stack  de¬ 
coder  is  returning  optimal  solutions  for  so  many 
sentences  (see  Table  1).  The  IP  and  stack  de¬ 
coders  enabled  us  to  quickly  locate  bugs  in  the 
greedy  decoder,  and  to  implement  extensions  to 
the  basic  greedy  search  that  can  find  better  solu¬ 
tions.  (We  came  up  with  the  greedy  operations 
discussed  in  Section  5  by  carefully  analyzing  er¬ 
ror  logs  of  the  kind  shown  in  Table  1).  The  results 
in  Table  1  also  enable  us  to  prioritize  the  items 
on  our  research  agenda.  Since  the  majority  of  the 
translation  errors  can  be  attributed  to  the  language 
and  translation  models  we  use  (see  column  PME 
in  Table  1),  it  is  clear  that  significant  improve¬ 
ment  in  translation  quality  will  come  from  better 


sent 

decoder 

time 

search 

translation 

length 

type 

(sec/sent) 

errors 

errors  (semantic 
and/or  syntactic) 

NE 

PME 

DSE 

ESE 

HSE 

CE 

6 

IP 

47.50 

0 

57 

44 

57 

0 

0 

0 

0 

6 

stack 

0.79 

5 

58 

43 

53 

1 

0 

0 

4 

6 

greedy 

0.07 

18 

60 

38 

45 

5 

2 

1 

10 

8 

IP 

499.00 

0 

76 

27 

74 

0 

0 

0 

0 

8 

stack 

5.67 

20 

75 

24 

57 

1 

2 

2 

15 

8 

greedy 

2.66 

43 

75 

20 

38 

4 

5 

1 

33 

Table  1:  Comparison  of  decoders  on  sets  of  101  test  sentences.  All  experiments  in  this  table  use  a 
bigram  language  model. 


sent 

length 

decoder 

type 

time 

(sec/sent) 

translation 
errors  (semantic 
and/or  syntactic) 

6 

stack 

13.72 

42 

6 

greedy 

1.58 

46 

6 

greedy* 

0.07 

46 

8 

stack 

45.45 

59 

8 

greedy 

2.75 

68 

8 

greedy* 

0.15 

69 

10 

stack 

105.15 

57 

10 

greedy 

3.83 

63 

10 

greedy* 

0.20 

68 

15 

stack 

>2000 

74 

15 

greedy 

12.06 

75 

15 

greedy* 

1.11 

75 

15 

greedy^ 

0.63 

76 

20 

greedy 

49.23 

86 

20 

greedy* 

11.34 

93 

20 

greedy^ 

0.94 

93 

Table  2:  Comparison  between  decoders  using  a 
trigram  language  model.  Greedy*  and  greedy^  are 
greedy  decoders  optimized  for  speed. 


models. 

The  results  in  Table  2,  obtained  with  decoders 
that  use  a  trigram  language  model,  show  that  our 
greedy  decoding  algorithm  is  a  viable  alternative 
to  the  traditional  stack  decoding  algorithm.  Even 
when  the  greedy  decoder  uses  an  optimized-for- 
speed  set  of  operations  in  which  at  most  one  word 
is  translated,  moved,  or  inserted  at  a  time  and  at 
most  3-word-long  segments  are  swapped — which 
is  labeled  “greedy*”  in  Table  2 — the  translation 
accuracy  is  affected  only  slightly.  In  contrast,  the 
translation  speed  increases  with  at  least  one  or¬ 
der  of  magnitude.  Depending  on  the  application 
of  interest,  one  may  choose  to  use  a  slow  decoder 
that  provides  optimal  results  or  a  fast,  greedy  de¬ 
coder  that  provides  non-optimal,  but  acceptable 
results.  One  may  also  run  the  greedy  decoder  us¬ 
ing  a  time  threshold,  as  any  instance  of  anytime 


algorithm.  When  the  threshold  is  set  to  one  sec¬ 
ond  per  sentence  (the  greedy^  label  in  Table  1), 
the  performance  is  affected  only  slightly. 
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